PandaLM: An Automatic Evaluation Benchmark for LLM Instruction Tuning Optimization
https://arxiv.org/abs/2306.05087
Our results indicate that PandaLM-7B achieves 93.75% of GPT-3.5's evaluation ability and 88.28% of GPT-4's in terms of F1-score on our test dataset.
https://github.com/WeOpenML/PandaLM